Actuarial Data Science Applications (ACTL4305/5305)
dplyr (a core member of tidyverse) for data manipulation and transformation1nycflights13 package, flights departing New York City in 2013ggplot2 to help us understand the data.
int stands for integers.dbl stands for doubles, or real numbers.chr stands for character vectors, or strings.dttm stands for date-times (a date + a time).lgl stands for logical, vectors that contain only TRUE or FALSE.fctr stands for factors, which R uses to represent categorical variables with fixed possible values.date stands for dates.Functions in dplyr package
%>% Pipe operatorglimpse() A glimpse into the data and its structurefilter() Pick observations by their valuesarrange() Reorder the rowsselect() Pick variables by their namessummarise() Collapse many values down to a singlegroup_by() Changes the scope of each function above from operating on the entire dataset to operating on it group-by-group
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite:
> bigger than>= bigger than or equal to< less than<= less than or equal to!= not equal== equal1filter() are combined with “and”: every expression must be true in order for a row to be included in the output.Other types of combinations using Boolean operators:
& is “and”| is “or”! is “not”x %in% y select every row where x is one of the values in yAccording to De Morgan’s law,
!(x & y) is the same as !x | !y!(x | y) is the same as !x & !yNA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:arrange(): Order by column names (or more complicated expressions)desc() to re-order by a column in descending orderSelectselect() allows you to select a useful subset based on the names of the variables.Select: Useful functionsstarts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches(): selects variables that match a regular expression.everything()Other related functions:
rename(): rename variablesmutate(): create new variables with functions of existing variablesSummariseSummarise: with group-byPlease have a try!
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
plot=ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'%>%na.rm=TRUE removes the missing valuesmean(x), median(x)sd(x), IQR(x)1min(x), quantile(x, 0.25), max(x)first(x), nth(x, 2), last(x)n(), sum(!is.na(x)), n_distinct(x)
count(tailnum, wt = distance), “count” (sum) the total number of miles a plane flewsum(x > 10), mean(y == 0)Three families of verbs designed to work with relational data:
Mutating joins: add new variables to one data frame from matching observations in another.
Filtering joins: filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations: treat observations as if they were set elements.
Other similar database system: SQL
nycflights13 contains five tibbles that are related to each other:
flights: gives information about each flightairlines: lets you look up the full carrier nameairports: gives information about each airport, identified by the faa airport codeplanes: gives information about each plane, identified by its tailnumweather: gives the weather at each NYC airport for each hourRelations of Tibbles
planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.count() the primary keys and look for entries where n is greater than one.flights table?
mutate() and row_number()flights %>%
arrange(year, month, day, sched_dep_time, carrier, flight) %>%
mutate(flight_id = row_number()) %>%
#TThis makes it possible to see every column in a data frame.
glimpse()Rows: 336,776
Columns: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -…
$ arr_time <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846,…
$ arr_delay <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7,…
$ carrier <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW",…
$ air_time <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, N…
$ distance <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005…
$ hour <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
$ flight_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
inner_join(): keeps observations that appear in both tables.left_join(): keeps all observations in x.right_join(): keeps all observations in y.full_join(): keeps all observations in x and y.by=NULL: uses all variables that appear in both tables, the so called natural joinby = "x": uses only some of the common variables.by = c("a" = "b"): match variable a in table x to variable b in table ybase::merge()| dplyr | merge |
|---|---|
| inner_join(x, y) | merge(x, y) |
| left_join(x, y) | merge(x, y, all.x = TRUE) |
| right_join(x, y) | merge(x, y, all.y = TRUE) |
| full_join(x, y) | merge(x, y, all.x = TRUE, all.y = TRUE) |
flights intersect(x, y): return only observations in both x and y.union(x, y): return unique observations in x and y.setdiff(x, y): return observations in x, but not in y.